Any business wants to maximize the number of customers. To achieve this goal, it is important not only to try to attract new ones, but also to retain existing ones. Retaining a client will cost the company less than attracting a new one. In addition, a new client may be weakly interested in business services and it will be difficult to work with him, while old clients already have the necessary data on interaction with the service.
Accordingly, predicting the churn, we can react in time and try to keep the client who wants to leave. Based on the data about the services that the client uses, we can make him a special offer, trying to change his decision to leave the operator. This will make the task of retention easier to implement than the task of attracting new users, about which we do not know anything yet.
You are provided with a dataset from a telecommunications company. The data contains information about almost six thousand users, their demographic characteristics, the services they use, the duration of using the operator's services, the method of payment, and the amount of payment.
The task is to analyze the data and predict the churn of users (to identify people who will and will not renew their contract). The work should include the following mandatory items:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency
%matplotlib inline
dataset = pd.read_csv("dataset.csv")
dataset.head()
| Unnamed: 0 | customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1869 | 7010-BRBUU | Male | 0 | Yes | Yes | 72 | Yes | Yes | No | ... | No internet service | No internet service | No internet service | No internet service | Two year | No | Credit card (automatic) | 24.10 | 1734.65 | No |
| 1 | 4528 | 9688-YGXVR | Female | 0 | No | No | 44 | Yes | No | Fiber optic | ... | Yes | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 88.15 | 3973.20 | No |
| 2 | 6344 | 9286-DOJGF | Female | 1 | Yes | No | 38 | Yes | Yes | Fiber optic | ... | No | No | No | No | Month-to-month | Yes | Bank transfer (automatic) | 74.95 | 2869.85 | Yes |
| 3 | 6739 | 6994-KERXL | Male | 0 | No | No | 4 | Yes | No | DSL | ... | No | No | No | Yes | Month-to-month | Yes | Electronic check | 55.90 | 238.50 | No |
| 4 | 432 | 2181-UAESM | Male | 0 | No | No | 2 | Yes | No | DSL | ... | Yes | No | No | No | Month-to-month | No | Electronic check | 53.45 | 119.50 | No |
5 rows × 22 columns
dataset = dataset.drop(columns=dataset.columns[0:2])
dataset.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | 0 | Yes | Yes | 72 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Credit card (automatic) | 24.10 | 1734.65 | No |
| 1 | Female | 0 | No | No | 44 | Yes | No | Fiber optic | No | Yes | Yes | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 88.15 | 3973.20 | No |
| 2 | Female | 1 | Yes | No | 38 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Bank transfer (automatic) | 74.95 | 2869.85 | Yes |
| 3 | Male | 0 | No | No | 4 | Yes | No | DSL | No | No | No | No | No | Yes | Month-to-month | Yes | Electronic check | 55.90 | 238.50 | No |
| 4 | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | No | Yes | No | No | No | Month-to-month | No | Electronic check | 53.45 | 119.50 | No |
dataset.dtypes
gender object SeniorCitizen int64 Partner object Dependents object tenure int64 PhoneService object MultipleLines object InternetService object OnlineSecurity object OnlineBackup object DeviceProtection object TechSupport object StreamingTV object StreamingMovies object Contract object PaperlessBilling object PaymentMethod object MonthlyCharges float64 TotalCharges float64 Churn object dtype: object
dataset["SeniorCitizen"] = dataset["SeniorCitizen"].map({0: "No", 1: "Yes"})
dataset.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Male | No | Yes | Yes | 72 | Yes | Yes | No | No internet service | No internet service | No internet service | No internet service | No internet service | No internet service | Two year | No | Credit card (automatic) | 24.10 | 1734.65 | No |
| 1 | Female | No | No | No | 44 | Yes | No | Fiber optic | No | Yes | Yes | No | Yes | No | Month-to-month | Yes | Credit card (automatic) | 88.15 | 3973.20 | No |
| 2 | Female | Yes | Yes | No | 38 | Yes | Yes | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Bank transfer (automatic) | 74.95 | 2869.85 | Yes |
| 3 | Male | No | No | No | 4 | Yes | No | DSL | No | No | No | No | No | Yes | Month-to-month | Yes | Electronic check | 55.90 | 238.50 | No |
| 4 | Male | No | No | No | 2 | Yes | No | DSL | Yes | No | Yes | No | No | No | Month-to-month | No | Electronic check | 53.45 | 119.50 | No |
def check_missing(dataset):
missing = False
for col in dataset.columns:
m = dataset[col].isna().sum()
if m > 0:
print(f"Column {col} has {m} missing values.")
missing = True
if not missing:
print("There aren't any missing values in the dataset.")
check_missing(dataset)
Column TotalCharges has 10 missing values.
fig, axes = plt.subplots(nrows=3, ncols=1,figsize=(12,10))
sns.boxplot(x='tenure',data=dataset,ax=axes[0])
sns.boxplot(x='MonthlyCharges',data=dataset,ax=axes[1], color='orange')
sns.boxplot(x='TotalCharges',data=dataset,ax=axes[2], color='green')
plt.show()
print("Total Number:")
print(dataset['gender'].value_counts(),"\n")
Total Number: Male 3050 Female 2936 Name: gender, dtype: int64
def group_by_val_counts(dataset, col1, col2):
group = dataset.groupby(col1)[col2].value_counts()
print(group)
def group_by_mean(dataset, col1, col2):
group = dataset.groupby(col1)[col2].mean()
print(group)
# How many customers are married?
group_by_val_counts(dataset, "gender", "Partner")
gender Partner
Female No 1500
Yes 1436
Male No 1582
Yes 1468
Name: Partner, dtype: int64
# How many customers are senior citizens?
group_by_val_counts(dataset, "gender", "SeniorCitizen")
gender SeniorCitizen
Female No 2454
Yes 482
Male No 2566
Yes 484
Name: SeniorCitizen, dtype: int64
# How many customers have dependents?
group_by_val_counts(dataset, "gender", "Dependents")
gender Dependents
Female No 2070
Yes 866
Male No 2125
Yes 925
Name: Dependents, dtype: int64
# What are the mean monthly charges for senior citizens vs. regular citizens,
# and what are the mean charges for all customers?
group_by_mean(dataset, "SeniorCitizen", "MonthlyCharges")
st = dataset.MonthlyCharges.mean()
print(f"\nDataset average: {st}")
SeniorCitizen No 61.904273 Yes 79.861905 Name: MonthlyCharges, dtype: float64 Dataset average: 64.80221349816237
# What are the mean monthly charges for each contract type,
# and how many contracts are there from each type?
group_by_mean(dataset, "Contract", "MonthlyCharges")
print("\n")
st = dataset.Contract.value_counts()
print(st)
Contract Month-to-month 66.305782 One year 65.148589 Two year 61.084559 Name: MonthlyCharges, dtype: float64 Month-to-month 3269 Two year 1441 One year 1276 Name: Contract, dtype: int64
For each of the following tests, we'll reject the null hypothesis if p-value > 0.05.
The null hypothesis for each of the following tests is
H0: The two tested variables are independent.
crosstab = pd.crosstab(index=dataset['InternetService'], columns=dataset['OnlineSecurity'])
print(crosstab)
st = chi2_contingency(crosstab)
print("\nThe P-Value of the ChiSq Test is: ", st[1])
OnlineSecurity No No internet service Yes InternetService DSL 1072 0 996 Fiber optic 1910 0 717 No 0 1291 0 The P-Value of the ChiSq Test is: 0.0
Since p-value is 0.0 then we reject the null hypothesis, they aren't independent variables.
We can conclude that features:
-OnlineSecurity
-OnlineBackup
-DeviceProtection
-TechSupport
-StreamingTV
-StreamingMovies
and InternetService are dependent fully with same p-value =0.0
crosstab = pd.crosstab(index=dataset['InternetService'], columns=dataset['Contract'])
print(crosstab)
st = chi2_contingency(crosstab)
print("\nThe P-Value of the ChiSq Test is: ",st[1])
Contract Month-to-month One year Two year InternetService DSL 1041 493 534 Fiber optic 1790 469 368 No 438 314 539 The P-Value of the ChiSq Test is: 1.3172465982259826e-107
p-value is not greater than 0.05, so we reject the null hypothesis. Therefore, InternetService and Contract aren't independent features.
crosstab = pd.crosstab(index=dataset['InternetService'], columns=dataset['PhoneService'])
print(crosstab)
st = chi2_contingency(crosstab)
print("\nThe P-Value of the ChiSq Test is: ", st[1])
PhoneService No Yes InternetService DSL 590 1478 Fiber optic 0 2627 No 0 1291 The P-Value of the ChiSq Test is: 5.3931701879801185e-270
p-value is not greater than 0.05, so we reject the null hypothesis. Therefore, InternetService and PhoneService aren't independent features.
crosstab = pd.crosstab(index=dataset['InternetService'], columns=dataset['MultipleLines'])
print(crosstab)
st = chi2_contingency(crosstab)
print("\nThe P-Value of the ChiSq Test is: ", st[1])
MultipleLines No No phone service Yes InternetService DSL 886 590 592 Fiber optic 965 0 1662 No 997 0 294 The P-Value of the ChiSq Test is: 0.0
p-value is not greater than 0.05, so we reject the null hypothesis. Therefore, InternetService and MultipleLines aren't independent features.
fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(12,7), sharey='all')
dataset[["gender"]].value_counts().plot(ax=axes[0], kind='bar', rot=0, color="yellow")
dataset[["SeniorCitizen"]].value_counts().plot(ax=axes[1], kind='bar',rot=0, color="aqua")
dataset[["Partner"]].value_counts().plot(ax=axes[2], kind='bar',rot=0, color="lightgreen")
dataset[["Churn"]].value_counts().plot(ax=axes[3], kind='bar',rot=0, color="purple")
plt.show()
plt.figure(figsize=(10,7))
sns.scatterplot(x=dataset['MonthlyCharges'],y=dataset['TotalCharges'], hue='Churn', data=dataset)
plt.show()
plt.figure(figsize=(10,7))
sns.scatterplot(x=dataset['tenure'],y=dataset['TotalCharges'], hue= "Churn", data=dataset)
plt.show()
There is some pattern, tenure is proportional with TotalCharges. However, we see that small gap between tenure's values [45,60+]
plt.figure(figsize=(10,7))
sns.scatterplot(x = "tenure", y = "MonthlyCharges", hue = "Churn", data = dataset)
plt.show()
avg_monthly = dataset["MonthlyCharges"].mean()
monthly_centered = dataset["MonthlyCharges"] - avg_monthly
plt.figure(figsize = (12, 6))
# Remove .axhline() to remove the black line in the middle
(sns.scatterplot(y = monthly_centered.values, x = monthly_centered.index, hue = dataset.Churn)
.axhline(0, color = "black", linewidth = 2))
plt.show()
# The following scatterplot shows how much customers pay above or below the average monthly charges, and whether they Churn or not.
Here we can see that customers that pay higher than average rates tend to leave the company more rather than customers that pay less than average.
plt.figure(figsize = (7, 7))
sns.barplot(x = "Partner", y = "MonthlyCharges", data = dataset)
plt.show()
plt.figure(figsize = (7, 7))
sns.barplot(x = "Churn", y = "tenure", data = dataset)
plt.show()
From the data above it seems that people with longer tenure renew their contracts much more.
fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey='all')
sns.histplot(data=dataset, x="MonthlyCharges", kde=True, ax=axs[0])
sns.histplot(data=dataset, x="TotalCharges", kde=True, color="orange", ax=axs[1])
sns.histplot(data=dataset, x="tenure", kde=True, color="green", ax=axs[2])
plt.show()
fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey='all')
sns.kdeplot(data = dataset["MonthlyCharges"],shade=True, ax=axs[0])
sns.kdeplot(data = dataset["TotalCharges"],shade=True, ax=axs[1], color = 'orange')
sns.kdeplot(data = dataset["tenure"],shade=True, ax=axs[2], color = 'green')
plt.show()
st = dataset.groupby(["Contract", "Churn"]).Churn.agg([len])
st = st.reset_index().rename(columns = {"len": "Count"})
plt.figure(figsize = (18, 12))
sns.catplot(x="Contract", y="Count", hue="Churn", kind="bar", data=st)
plt.show()
st
<Figure size 1296x864 with 0 Axes>
| Contract | Churn | Count | |
|---|---|---|---|
| 0 | Month-to-month | No | 1871 |
| 1 | Month-to-month | Yes | 1398 |
| 2 | One year | No | 1128 |
| 3 | One year | Yes | 148 |
| 4 | Two year | No | 1400 |
| 5 | Two year | Yes | 41 |
The data above shows that custumers with month-to-month contracts tend to churn much more, where as custumers with two year contracts leave the company the least.
fig, axs = plt.subplots(1, 3, figsize=(12, 7), sharey='all')
pd.crosstab(index=dataset['SeniorCitizen'],columns=dataset['PaymentMethod']).plot(kind='bar', ax=axs[0],rot=0)
pd.crosstab(index=dataset['InternetService'],columns=dataset['PaymentMethod']).plot(kind='bar', ax=axs[1],rot=0)
pd.crosstab(index=dataset['Churn'],columns=dataset['PaymentMethod']).plot(kind='bar', ax=axs[2],rot=0)
plt.show()
sns.heatmap(dataset[['tenure', 'MonthlyCharges', 'TotalCharges']].corr(),
vmax=1, vmin=-1,annot=True,)
<AxesSubplot:>
# NaN values are continous variables
dataset.describe(include='all').loc['unique', :]
gender 2 SeniorCitizen 2 Partner 2 Dependents 2 tenure NaN PhoneService 2 MultipleLines 3 InternetService 3 OnlineSecurity 3 OnlineBackup 3 DeviceProtection 3 TechSupport 3 StreamingTV 3 StreamingMovies 3 Contract 3 PaperlessBilling 2 PaymentMethod 4 MonthlyCharges NaN TotalCharges NaN Churn 2 Name: unique, dtype: object
dataset = dataset.reset_index(drop=True)
dataset.to_csv("dataset1.csv")